larger network
Towards Scalable Backpropagation-Free Gradient Estimation
Wang, Daniel, Markou, Evan, Campbell, Dylan
While backpropagation--reverse-mode automatic differentiation--has been extraordinarily successful in deep learning, it requires two passes (forward and backward) through the neural network and the storage of intermediate activations. Existing gradient estimation methods that instead use forward-mode automatic differentiation struggle to scale beyond small networks due to the high variance of the estimates. Efforts to mitigate this have so far introduced significant bias to the estimates, reducing their utility. We introduce a gradient estimation approach that reduces both bias and variance by manipulating upstream Jacobian matrices when computing guess directions. It shows promising results and has the potential to scale to larger networks, indeed performing better as the network width is increased. Our understanding of this method is facilitated by analyses of bias and variance, and their connection to the low-dimensional structure of neural network gradients.
- Asia > Middle East > Jordan (0.04)
- Oceania > Australia > Australian Capital Territory > Canberra (0.04)
We further evaluate our model on five UCI
We thank all the reviewers for the valuable comments and suggestions. Feature normalization is applied in the experiments. MLP with one hidden layer of 50 units. We appreciate the suggestions on writing and are to fix them in the future revision. We acknowledge over-parameterization may fit some real applications better under certain scenarios.
detailed comments
We thank all reviewers for their valuable comments. We'll further improve in the final version. Q1: Beyond regression tasks: In this paper, we focus on regression tasks. It's indeed an interesting topic to systematically investigate on the robustness of We note that there is some recent work (e.g., [*1]) that studies the robustness of the MMD estimators, Q3. Results of larger networks: Guo et al. (ICML 2017) argued that the miscalibration was due to the sheer size of We'll make it clearer in the final version.
Growing with Experience: Growing Neural Networks in Deep Reinforcement Learning
Fehring, Lukas, Lindauer, Marius, Eimer, Theresa
While increasingly large models have revolutionized much of the machine learning landscape, training even mid-sized networks for Reinforcement Learning (RL) is still proving to be a struggle. This, however, severely limits the complexity of policies we are able to learn. To enable increased network capacity while maintaining network trainability, we propose GrowNN, a simple yet effective method that utilizes progressive network growth during training. We start training a small network to learn an initial policy. Then we add layers without changing the encoded function. Subsequent updates can utilize the added layers to learn a more expressive policy, adding capacity as the policy's complexity increases. GrowNN can be seamlessly integrated into most existing RL agents. Our experiments on MiniHack and Mujoco show improved agent performance, with incrementally GrowNN-deeper networks outperforming their respective static counterparts of the same size by up to 48% on MiniHack Room and 72% on Ant.
Review for NeurIPS paper: GradAug: A New Regularization Method for Deep Neural Networks
Summary and Contributions: After rebuttal and discussion with other reviewers I have updated my score. However, I do point out several concerns of mine which the authors could consider further validation for: It's good that the authors performed the time/memory comparison in the rebuttal as that was a significant concern of mine. My concerns mostly revolve around what other techniques should we compare this against? Given that this algorithm takes 3-4x the time with comparison to the baseline, I could for example: 1: Train a much larger network and then use compression techniques to slim it to the same size. Mixup which is still 70% faster.
Dissecting a Small Artificial Neural Network
Yang, Xiguang, Arora, Krish, Bachmann, Michael
We investigate the loss landscape and backpropagation dynamics of convergence for the simplest possible artificial neural network representing the logical exclusive-OR (XOR) gate. Cross-sections of the loss landscape in the nine-dimensional parameter space are found to exhibit distinct features, which help understand why backpropagation efficiently achieves convergence toward zero loss, whereas values of weights and biases keep drifting. Differences in shapes of cross-sections obtained by nonrandomized and randomized batches are discussed. In reference to statistical physics we introduce the microcanonical entropy as a unique quantity that allows to characterize the phase behavior of the network. Learning in neural networks can thus be thought of as an annealing process that experiences the analogue of phase transitions known from thermodynamic systems. It also reveals how the loss landscape simplifies as more hidden neurons are added to the network, eliminating entropic barriers caused by finite-size effects.
Adaptive Neural Networks Using Residual Fitting
Ford, Noah, Winder, John, McClellan, Josh
Current methods for estimating the required neural-network size for a given problem class have focused on methods that can be computationally intensive, such as neural-architecture search and pruning. In contrast, methods that add capacity to neural networks as needed may provide similar results to architecture search and pruning, but do not require as much computation to find an appropriate network size. Here, we present a network-growth method that searches for explainable error in the network's residuals and grows the network if sufficient error is detected. We demonstrate this method using examples from classification, imitation learning, and reinforcement learning. Within these tasks, the growing network can often achieve better performance than small networks that do not grow, and similar performance to networks that begin much larger.
Regularizing Deep Neural Networks
Lets discuss Regularizing Deep Neural Networks. Deep neural nets with an outsized number of parameters are very powerful machine learning systems. However, overfitting may be a significant issue in such networks. Making it hard to affect over-fitting by associating the predictions of the many different large neural nets at test time, big networks similarly are slow to use. Dropout might be a technique for addressing this problem.